Goto

Collaborating Authors

 medical language model


Refine Medical Diagnosis Using Generation Augmented Retrieval and Clinical Practice Guidelines

Li, Wenhao, Zhang, Hongkuan, Zhang, Hongwei, Li, Zhengxu, Dong, Zengjie, Chen, Yafan, Bidargaddi, Niranjan, Liu, Hong

arXiv.org Artificial Intelligence

-- Current medical language models, adapted from large language models (LLMs), typically predict ICD code - based diagnosis from electronic health records (EHRs) because these labels are readily available. However, ICD codes do not capture the nuanced, context - rich reasoning clinicians use for diagnosis. Clinicians synthesize diverse patient data and reference clinical practice guidelines (CPGs) to make evidence - based decisions. This misalignment limits the clinical utility of existing models. We introduce GARMLE - G, a Generation - Augmented Retrieval framework that grounds medical language model outp uts in authoritative CPGs. Unlike conventional Retrieval - Augmented Generation based approaches, GARMLE - G enables hallucination - free outputs by directly retrieving authoritative guideline content without relying on model - generated text. It (1) integrates LLM predictions with EHR data to create semantically rich queries, (2) retrieves relevant CPG knowledge snippets via embedding similarity, and (3) fuses guideline content with model output to generate clinically aligned recommendations. A prototype system for hypertension diagnosis was developed and evaluated on multiple metrics, demonstrating superior retrieval precision, semantic relevance, and clinical guideline adherence compared to RAG - based baselines, while maintaining a lightweight architecture suitable for localized healthcare deployment. This work provides a scalable, low - cost, and hallucination - free method for grounding medical language models in evidence - based clinical practice, with strong potential for broader clinical deployment. The research reported in this paper is financially supported by the National Natural Science Foundation of China (62276156), the project of Shandong Provincial Natural Science Foundation (ZR2024LZH005), the Taishan Scholar Program of Shandong Province of China (No.tsq nz20240809), and the Excellent Youth Foundation of Shandong Natural Science Foundation (2024HWYQ - 055). Wenhao Li is with Shandong Normal University, Jinan, China, 250358 (email: lwh@sdnu.edu.cn) Hongkuan Zhang is with Shandong Normal University, Jinan, China, 250358 (email: 2024217028@stu.sdnu.edu.cn) In the healthcare sector, language models and related tools, such as ChatGPT and ClinicalBERT, have been increasingly applied across multiple scenarios, including disease prediction, clinical decision support, patient interaction, drug discovery, and personalized medicine, significantly driving innovation and transformation in medical technology [1, 2] . As a fundamental task in healthcare, disease diagnosis refers to the process by which health professionals identify the most likely disease or disorder causing a patient's symptoms [3] .


Leveraging Online Data to Enhance Medical Knowledge in a Small Persian Language Model

Ghassabi, Mehrdad, Rostami, Pedram, Kashani, Hamidreza Baradaran, Poursina, Amirhossein, Kazemi, Zahra, Tavakoli, Milad

arXiv.org Artificial Intelligence

The rapid advancement of language models has demonstrated the potential of artificial intelligence in the healthcare industry. However, small language models struggle with specialized domains in low-resource languages like Persian. While numerous medical-domain websites exist in Persian, no curated dataset or corpus has been available making ours the first of its kind. This study introduces a newly curated dataset comprising 20k doctor-patient Q\&A pairs and 60\% of a 90-million-token crawled corpus from medical magazines. Using a parameter-efficient fine-tuning approach, we enhanced the medical knowledge of the baseline model, aya-expanse-8b. Benchmark evaluations demonstrate that the fine-tuned model achieves improved accuracy in medical question answering and successfully passed the Iranian Basic Medical Science Entrance Exam (IBSEE) in September 2023, which the baseline model did not. Additionally, the fine-tuned model improved Persian-translated MMLU accuracy by an average of 2.67\%. This work highlights the potential of leveraging open-access online data to enrich small language models in medical fields, providing a novel solution for Persian medical AI applications suitable for resource-constrained environments. Future research could explore multimodal input to further enhance performance.


Rethinking Evidence Hierarchies in Medical Language Benchmarks: A Critical Evaluation of HealthBench

Mutisya, Fred, Gitau, Shikoh, Ongoma, Nasubo, Mbae, Keith, Wamicha, Elizabeth

arXiv.org Artificial Intelligence

HealthBench, a benchmark designed to measure the capabilities of AI systems for health better (Arora et al., 2025), has advanced medical language model evaluation through physician-crafted dialogues and transparent rubrics. However, its reliance on expert opinion, rather than high-tier clinical evidence, risks codifying regional biases and individual clinician idiosyncrasies, further compounded by potential biases in automated grading systems. These limitations are particularly magnified in low- and middle-income settings, where issues like sparse neglected tropical disease coverage and region-specific guideline mismatches are prevalent. The unique challenges of the African context, including data scarcity, inadequate infrastructure, and nascent regulatory frameworks, underscore the urgent need for more globally relevant and equitable benchmarks. To address these shortcomings, we propose anchoring reward functions in version-controlled Clinical Practice Guidelines (CPGs) that incorporate systematic reviews and GRADE evidence ratings. Our roadmap outlines "evidence-robust" reinforcement learning via rubric-to-guideline linkage, evidence-weighted scoring, and contextual override logic, complemented by a focus on ethical considerations and the integration of delayed outcome feedback. By re-grounding rewards in rigorously vetted CPGs, while preserving HealthBench's transparency and physician engagement, we aim to foster medical language models that are not only linguistically polished but also clinically trustworthy, ethically sound, and globally relevant.


A Method for the Architecture of a Medical Vertical Large Language Model Based on Deepseek R1

Zhang, Mingda, Qin, Jianglong

arXiv.org Artificial Intelligence

Despite significant advances in foundation models like DeepSeek-R1 and ChatGPT, their deployment in medical settings faces critical challenges including computational requirements and professional knowledge barriers. This paper presents an efficient lightweight medical large language model architecture that systematically addresses these challenges through three-dimensional optimization: knowledge acquisition, model compression, and computational enhancement. We design a knowledge transfer pipeline from DeepSeek-R1-Distill-70B to DeepSeek-R1-Distill-7B using Low-Rank Adaptation (LoRA) for precise medical knowledge retention. Through 4-bit quantization and mixed-precision strategies, we achieve substantial model compression while preserving medical reasoning capabilities. The inference framework incorporates Flash Attention acceleration and continuous batching, complemented by specialized prompt templates for diverse medical queries. Experimental evaluation on medical benchmarks demonstrates that our approach maintains 92.1% accuracy on USMLE examinations while reducing memory consumption by 64.7% and inference latency by 12.4% compared to baseline models. This work provides a practical solution for deploying advanced language models in resource-constrained medical environments, enabling broader accessibility of AI-assisted healthcare.


LlamaCare: A Large Medical Language Model for Enhancing Healthcare Knowledge Sharing

Sun, Maojun

arXiv.org Artificial Intelligence

Large language models (LLMs) have shown amazing capabilities in knowledge memorization and the present. However, when it comes to domain-specific knowledge and downstream tasks like medical, general LLMs are often unable to give precise answers. In addition, when people want LLMs to answer classification questions, they usually go through instruction tuning first. However, LLMs do not always give a direct index of the categorization after instruction tuning. In this paper, we proposed LlamaCare, a fine-tuned medical language model, and Extended Classification Integration(ECI), a module to handle classification problems of LLMs. Our contributions are : (i) We fine-tuned a large language model of medical knowledge with very low carbon emissions and achieved similar performance with ChatGPT by a 24G GPU. (ii) We solved the problem of redundant categorical answers and improved the performance of LLMs by proposing a new module called Extended Classification Integration. (iii) We released our processed data for one-shot and few-shot training for some benchmarks such as PubMedQA and USMLE 1-3 step. Our method achieves a close performance comparable to some state-of-the-art models with the same quantity of parameters on benchmarks, while being more environmentally friendly by using less GPU computation time. Our models, codes, and datasets can be found at \url{https://github.com/Stephen-SMJ/LLamaCare}.


Addressing cognitive bias in medical language models

Schmidgall, Samuel, Harris, Carl, Essien, Ime, Olshvang, Daniel, Rahman, Tawsifur, Kim, Ji Woong, Ziaei, Rojin, Eshraghian, Jason, Abadir, Peter, Chellappa, Rama

arXiv.org Artificial Intelligence

There is increasing interest in the application large language models (LLMs) to the medical field, in part because of their impressive performance on medical exam questions. While promising, exam questions do not reflect the complexity of real patient-doctor interactions. In reality, physicians' decisions are shaped by many complex factors, such as patient compliance, personal experience, ethical beliefs, and cognitive bias. Taking a step toward understanding this, our hypothesis posits that when LLMs are confronted with clinical questions containing cognitive biases, they will yield significantly less accurate responses compared to the same questions presented without such biases. In this study, we developed BiasMedQA, a benchmark for evaluating cognitive biases in LLMs applied to medical tasks. Using BiasMedQA we evaluated six LLMs, namely GPT-4, Mixtral-8x70B, GPT-3.5, PaLM-2, Llama 2 70B-chat, and the medically specialized PMC Llama 13B. We tested these models on 1,273 questions from the US Medical Licensing Exam (USMLE) Steps 1, 2, and 3, modified to replicate common clinically-relevant cognitive biases. Our analysis revealed varying effects for biases on these LLMs, with GPT-4 standing out for its resilience to bias, in contrast to Llama 2 70B-chat and PMC Llama 13B, which were disproportionately affected by cognitive bias. Our findings highlight the critical need for bias mitigation in the development of medical LLMs, pointing towards safer and more reliable applications in healthcare.